fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience) by FumingPower3925 · Pull Request #63 · goceleris/loadgen

FumingPower3925 · 2026-06-21T10:06:37Z

Three resilience fixes so the full benchmark matrix survives servers it previously DNF'd on. Each is the loadgen side of a column that produced zero requests in the last full run.

1. h1client: backoff-paced reconnect on read EOF

The write-error path already reconnected + backed off; the read-status and read-header EOF paths returned the bare error. A Connection: close server — or one closing mid-response under churn-close — surfaces as a read EOF, so a close-after-one-response server spun read-EOFs with no pacing and never recovered. drogon collapsed to 0 successful requests (churn-close cell) from exactly this.

2. h2client: honor server MAX_FRAME_SIZE + send-window flow control

POST bodies > 16384 B were sent as a single oversized DATA frame (FRAME_SIZE_ERROR) and bodies > the 65535 initial send window overran flow control — the post-64k-h2 failure (5 h2 columns: aspnet-h2, axum-h2, elysia-h2, hono-h2, hyper-h2). Now: capture the server's SETTINGS_MAX_FRAME_SIZE + SETTINGS_INITIAL_WINDOW_SIZE at handshake, split at the frame size, and pace against the connection + per-stream send windows, replenished from WINDOW_UPDATE. Regression test posts 200000 B through a strict 16384/65535 h2c server; pre-fix single-frame send fails it with GOAWAY code=6.

3. h2client: re-dial connections the server closes/GOAWAYs mid-cell

The h2 client dialed once in New() and never recovered a torn-down conn. A server that GOAWAYs/closes connections periodically — hypercorn does, so fastapi-h2 hit it every cell: ~1.1 billion errors / 0 requests per 35 s cell — left the slot dead and every DoRequest hot-looped the closed-conn error (the h2 analog of #1). Each conn is now an h2ConnSlot (atomic.Pointer + single-flight redial + backoff); DoRequest re-dials a dead slot and swaps the fresh conn in lock-free.

Verification

Full suite green on x/net 0.56.0, including under -race.
New regression tests for the flow-control split and the reconnect path.
Live hypercorn h2c repro (the fastapi-h2 server): GET 0 → 47.6k req (0.4% err), POST-65536-body 0 → 16k req — both previously zero.

A Connection:close server (or one closing mid-response under churn-close) surfaces as a read EOF on the status line or a header line, not just on the write. The write-error path already reconnects + backs off; the two read paths returned the bare error, so a close-after-one-response server spun read-EOFs with no pacing and never re-established a usable conn. drogon collapsed to 0 successful requests from exactly this. Mirror the write path: reconnect for the next request, recordConnectError + backoff only when the server is genuinely down, otherwise reset the backoff.

POST bodies larger than 16384 B were sent as a single oversized DATA frame (FRAME_SIZE_ERROR against a 16384-default server) and bodies larger than the 65535 initial send window overran flow control (FLOW_CONTROL_ERROR / hang to the 5-min deadline). This is the post-64k-h2 failure (64 KiB body = 65536 B, one past the window). Capture the server's SETTINGS_MAX_FRAME_SIZE and SETTINGS_INITIAL_WINDOW_SIZE at handshake; split the body at the server's frame size and pace it against the connection (RFC 7540 §6.9.2: starts 65535) and per-stream send windows, replenished from WINDOW_UPDATE in readLoop. The writer goroutine is sequential, so the active stream's window is tracked by curStreamID/curStreamWindow. Drop the now-dead h2WriteReq.maxFrame field. Regression test posts a 200000-B body through a strict h2c server advertising 16384/65535 (matching real bench targets, not x/net's lenient 1 MiB defaults); the pre-fix single-frame send fails it with GOAWAY code=6.

The h2 client dialed its connections once in New() and never recovered one the server tore down. A server that GOAWAYs or closes connections periodically — hypercorn does, so the fastapi-h2 column hit it on every cell — left the slot permanently dead: readLoop marks the conn closed on GOAWAY and returns, then every DoRequest returns the bare closed-conn error with no pacing. fastapi-h2 logged ~1.1 billion errors / 0 successful requests per 35 s cell from exactly this hot loop (the h2 analog of the h1 churn-close bug). Wrap each connection in an h2ConnSlot (atomic.Pointer to the live conn + single-flight redial mutex + connectBackoff). DoRequest re-dials a dead slot, paced by the slot backoff, swapping the fresh conn in atomically so sibling workers pick it up lock-free. Verified against a live hypercorn h2c server: GET 0 -> 47.6k req (0.4% error), POST-65536-body 0 -> 16k req — both previously zero. Regression test closes a live conn out from under the client and asserts the next requests recover against the still-running server.

FumingPower3925 added 2 commits June 21, 2026 12:06

FumingPower3925 force-pushed the fix/h2-flowctl-and-h1-churn-backoff branch from c30c81e to a9fb908 Compare June 21, 2026 10:11

FumingPower3925 changed the title ~~fix(loadgen): h1 churn-backoff on read EOF + h2 frame-size/flow-control (post-64k-h2)~~ fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience) Jun 21, 2026

FumingPower3925 merged commit 7328187 into main Jun 21, 2026
3 checks passed

FumingPower3925 deleted the fix/h2-flowctl-and-h1-churn-backoff branch June 21, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience)#63

fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience)#63
FumingPower3925 merged 3 commits into
mainfrom
fix/h2-flowctl-and-h1-churn-backoff

FumingPower3925 commented Jun 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. h1client: backoff-paced reconnect on read EOF

2. h2client: honor server MAX_FRAME_SIZE + send-window flow control

3. h2client: re-dial connections the server closes/GOAWAYs mid-cell

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FumingPower3925 commented Jun 21, 2026 •

edited

Loading